Please let me know if you find mistakes (in particular, some characters might have disappeared during Markdown formatting). Bugfixers so far: Andreas (1).

1.1 Regular Expressions

Write regular expressions for (at least some of) the following. The valid/invalid examples are list of literal strings you might paste directly into your Java or Python (or similar) program for testing your regular expressions.

The language of Norwegian sheep. Sheep can say bæ, bæææ, bæbæbææææ, and so on, with at least one 'æ' following a 'b'. Examples:

Valid strings: "bæ", "bæbæ", "bææææbææææbæbæ", "bæbæbæbæbæ"

Invalid strings: "", "b", "bø", "ææææ", "æ", "bbæææ", "quack"
Traditional C comments. A C comment starts with /* and ends with */ and cannot be nested. Java/C++-style //-comments are not supported. Examples:

Valid strings: "/* foo */", "/** */", "/*/* /**/"

Invalid strings: "/*", "// foo", "/* /* */ */", "/*/"
String literals, according to the following rules (C-like): they begin and end with a double quote ("), and any double or single quote must be escaped by a preceding backslash (\). The backslash itself must also be escaped. For example, "foo", "fo\"o" and "fo\\" and "\\\"" are legal strings, and "ba"", "b\jar" are illegal.

Valid strings: "\"foo\"", "\"fo\\\"o\"", "\"\"" (i.e., "foo", "fo\"o", "" without extra quotes)

Invalid strings: "\"", "\"f\"o\"", "\"\\\\\\\"" (i.e., ", "f"o", "\\\" without extra quotes)
Add the following to the string literal definition: a string may contain \xNN or \uNNNN, where N is any hexadecimal digit (this would be used to insert arbitrary characters in hexadecimal ASCII or Unicode notation).
Integer constants: an optional + or -, followed by decimal digits.

Valid strings: "1234", "+1", "-0", "0"

Invalid strings: "+-0", "+", "a", "", "0.0"
Numeric constants in scientific notation, consisting of:
- an optional sign (+ or -)
- zero or more decimal digits
- a decimal point (.)
- zero or more decimal digits
- an exponent part, consting of
  - an upper or lowercase e
  - an optional sign (+ or -)
  - one or more decimal digits Addtional rules: the numbers should contain at least one digit either before or after the decimal point; and at least one of the decimal point or the exponent must be present.
Valid strings: "-2.0", "2e5", "314e-2", ".0", "1."

Invalid strings: ".", "1", "1e2e4", "1e2.0", "1.+10e2"
Internet names and IP addresses. Find the exact syntax rules online or make something up; in general, a host or domain name has one or more labels separated by dots (.), possibly ending with a dot. An IPv4 address in dotted-decimal notation contains four numbers from 0-255, separated by dots (.), and an IPv6 address contains:
- Eight groups of four hexadecimal digits, separated by colons (:)
- Leading zeros may be omitted from groups
- Multiple groups of zeros may be replaced by a double colon (::) – this is only legal once in an address

Extra Regular Expression Shorthands

The following shorthands are supported by many, but not all tools, and may be useful:

\s or [:space:] any space (incl tabs and newlines)
\S any non-space
\w any word-character (alphanumeric plus underscore)
\W any non-word-character
\d or [:digit:] a digit
\D a non-digit

The [:foo:] style is typically used by standard Unix tools, and only works inside character classes; e.g., [[:space:]]. Perl, Java, Python and many other languages support the \x convention.

1.2 Regular Expressions and Tools

Try your regular expressions from 1.1 on different tools (see below for examples).
Check that they actually work and match what you think they did.
Do you have to modify your expressions to make them fit the syntax of the tool?

Matching vs Searching

There's often a distinction between matching and searching. Matching a regular expression means checking that a string conforms to the grammar specified by the regular expression. Typically the match starts at the beginning of the string; the string may or may not be allowed to have trailing characters that are not matched. Searching will try to identify one (or more) substrings that match a regular expressions.

If you want to be sure your regular expression is matched against the entire string, insert a ^ (matches beginning of string) and $ (end of string) at the beginning/end of the regular expression. Whether 'beginning' or 'end' refers to a line or the entire string depends on your tool (e.g., grep will match lines; String.matches() will match the entire string by default). For example:

Starts with a comment: ^//

An empty string/line: ^$

A string/line of spaces: ^ *$

Regular Expressions in Java

You can match regular expressions in Java using the String.matches() method:

#!java
assertTrue("foo".matches("fo+"));

In more complicated cases, you'll want to use the Pattern class, which lets you precompile a regular expression (makes it faster!) and also lets you do more advanced things.

Java supports a wide range of regular expression operators.

Regular Expressions in Python

In Python, you'll find tools for regulare expressions in the re package:

#!python
import re
re.match("fo+", "f")

re.match() gives you a Match object on success, which you can use to find out more about the match (start / end of matching region, information about subgroups (things that are in parentheses in the expression) and so on). If there is no match, you get None.

For example, in Python 2.x, you might test a whole bunch of strings this way:

#!python
for s in ["a", "foo", "bææ"]:
     print s, re.match("fo+", s) != None

Python also has a rich regular expression language, similar to that found in Perl.

Regular Expressions in Unix Tools

You can use grep(1) to find matches in files:

#!bash
grep regexp file1...

Typically you'll have to enclose the regular expression in single quotes, or the shell will expand it and cause confusion.

This will find all (or at least most – it will be confused by comments and line breaks) productions in a Rascal grammar:

#!bash
grep '^[ \t]*syntax[ \t][ \t]*[A-Z][a-zA-Z0-9_]*' MyLang.rsc

Similarly, sed(1) will allow you to do simple text maniulations:

#!bash
sed -e 's/FOO/BAR/g' < input > output

Replaces all instances of FOO by BAR in input and writes the result to output.

Both sed and grep support a limited subset of regular expressions (e.g., they don't support +). The variant egrep(1) is better, but still somewhat different from what you find in Python, Java, Perl, etc.

1.3 Grammars

Consider this very simple grammar (in Rascal syntax notation:

start syntax Program = Expr;

syntax Expr
    = ID                  // variables
    | NUM                 // integers
    | Expr "+" Expr       // addition
    | Expr "*" Expr       // multiplication
    | ID "(" Expr ")"     // function call
    | "(" Expr ")"        // parentheses
    ;

The start symbol is Program, and we define a single non-termial Expr for expressions, with multiple alternatives (separated by |). In addition to the literal operator symbols, we have two terminals ID and NUM. Their definitions are:

// identifiers
//    y !<< x means 'x' must not be preceeded by  'y'
//    x !>> y means 'x' must not by followed by 'y'
// so, this means that an identifier is a sequence of one
// or more letters or underscores, with no additional
// letters or underscores before or after
lexical ID = [a-zA-Z_] !<< [a-zA-Z_]+ !>> [a-zA-Z_];

// numbers
lexical NUM = [0-9] !<< [0-9]+ !>> [0-9];

The interesting bits are [a-zA-Z_]+ and [0-9]+. The stuff before !<< and after !>> is just special Rascal notation for saying that there shouldn't be any leftover letters/digits before/after the identifier or number.

Write a few syntactically valid expressions. Make sure you use all the "features" of the language, including nested expressions.
Draw parse trees for your expressions.
Draw parse trees for the following expressions. Do any of the expressions have multiple parse trees? If so, draw all the trees.

    a + b * 2

    f(g(x))

    a + b + c

    (a + b) * 2

    (f((2)))

1.4 Parsing with Rascal

Do this if you have extra time, want to get a head start or don't have anything better to do in your weekend.

Follow the Rascal installation instructions.
With the Simple grammar still loaded in the console, try parsing all the example expressions you created. Things in the concrete object syntax (the grammar you've supplied, as opposed to the grammar for Rascal itself) are written in back-quotes (`a + b`). You can tell Rascal which non-terminal to expect by prefixing the concrete syntax fragment with the name of the non-termial in parentheses:

rascal>(Program)`x`
Program: `x`
Tree: appl(prod(sort("Program"),[sort("Expr")],{}),[appl(prod(sort("Expr"),[lex("ID")],{}),[appl(prod(lex("ID"),[conditional(iter(\char-class([range(65,90),range(95,95),range(97,122)])),{\not-precede(\char-class([range(65,90),range(95,95),range(97,122)])),\not-follow(\char-class([range(65,90),range(95,95),range(97,122)]))})],{}),[appl(regular(iter(\char-class([range(65,90),range(95,95),range(97,122)]))),[char(120)])[@loc=|stdin:///|(10,1,<1,10>,<1,11>)]])[@loc=|stdin:///|(10,1,<1,10>,<1,11>)]])[@loc=|stdin:///|(10,1,<1,10>,<1,11>)]])[@loc=|stdin:///|(10,1,<1,10>,<1,11>)]
rascal>(Expr)`x`
Expr: `x`
Tree: appl(prod(sort("Expr"),[lex("ID")],{}),[appl(prod(lex("ID"),[conditional(iter(\char-class([range(65,90),range(95,95),range(97,122)])),{\not-precede(\char-class([range(65,90),range(95,95),range(97,122)])),\not-follow(\char-class([range(65,90),range(95,95),range(97,122)]))})],{}),[appl(regular(iter(\char-class([range(65,90),range(95,95),range(97,122)]))),[char(120)])[@loc=|stdin:///|(7,1,<1,7>,<1,8>)]])[@loc=|stdin:///|(7,1,<1,7>,<1,8>)]])[@loc=|stdin:///|(7,1,<1,7>,<1,8>)]
rascal>(ID)`x`
ID: `x`
Tree: appl(prod(lex("ID"),[conditional(iter(\char-class([range(65,90),range(95,95),range(97,122)])),{\not-precede(\char-class([range(65,90),range(95,95),range(97,122)])),\not-follow(\char-class([range(65,90),range(95,95),range(97,122)]))})],{}),[appl(regular(iter(\char-class([range(65,90),range(95,95),range(97,122)]))),[char(120)])[@loc=|stdin:///|(5,1,<1,5>,<1,6>)]])[@loc=|stdin:///|(5,1,<1,5>,<1,6>)]

Try running unparse(...) on the result. You should get the input string back.

Wiki

inf225public / Exercise 1